LE CODE EN COMMENTAIRE, C’EST JUSTE POUR POUVOIR KNIT PLUS VITE!! (avec cache = TRUE, ca re-run pas le code, mais le markdown doit quand meme compiler les graphes, du coup ca prend du temps à chq fois…)

names(cred) <- tolower(names(cred)) # changing the name of the variable into lowercase
cred <- cred[,-1] # deleting the first useless column
sum(is.na(cred)) # checking if we have missing data

Introduction

Parler de: - méthode utilisée - grandes étapes - classification task (goal) =/= prediction - [à compléter…]

–> Exploration des données = 1st insight –> Modelling: train/set set + selection du “meilleur” arbre(pruning)/svm(tuning)/NNET(nb of layers)/regression logicstic/K-NN(choisir K)/etc. + presentation de détail de tout ce bordel –> Cross validation avec les meilleurs de chq models pour selectionner le grand gagnant…

–> Description des variables (comme ca c’est plus simple de s’en sortir?!)

Exploratory analysis

Explanatory variables

Overview

Before beginning any kind of analysis, we have to understand the data we are working with.

data.frame(variable = names(cred),
           classe = sapply(cred, typeof),
           first_values = sapply(cred, function(x) paste0(head(x),  collapse = ", ")),
           row.names = NULL) %>% 
  kable(caption="Overview of our data") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F) %>%
  column_spec(1, width = "10em", border_right = T) %>%
  column_spec(2, width = "6em") %>%
  column_spec(3, width = "18em") %>%
  scroll_box(width = "70%", height = "200px")
Overview of our data
variable classe first_values
chk_acct integer 0, 1, 3, 0, 0, 3
duration integer 6, 48, 12, 42, 24, 36
history integer 4, 2, 4, 2, 3, 2
new_car integer 0, 0, 0, 0, 1, 0
used_car integer 0, 0, 0, 0, 0, 0
furniture integer 0, 0, 0, 1, 0, 0
radio.tv integer 1, 1, 0, 0, 0, 0
education integer 0, 0, 1, 0, 0, 1
retraining integer 0, 0, 0, 0, 0, 0
amount integer 1169, 5951, 2096, 7882, 4870, 9055
sav_acct integer 4, 0, 0, 0, 0, 4
employment integer 4, 2, 3, 3, 2, 2
install_rate integer 4, 2, 2, 2, 3, 2
male_div integer 0, 0, 0, 0, 0, 0
male_single integer 1, 0, 1, 1, 1, 1
male_mar_or_wid integer 0, 0, 0, 0, 0, 0
co.applicant integer 0, 0, 0, 0, 0, 0
guarantor integer 0, 0, 0, 1, 0, 0
present_resident integer 4, 2, 3, 4, 4, 4
real_estate integer 1, 1, 1, 0, 0, 0
prop_unkn_none integer 0, 0, 0, 0, 1, 1
age integer 67, 22, 49, 45, 53, 35
other_install integer 0, 0, 0, 0, 0, 0
rent integer 0, 0, 0, 0, 0, 0
own_res integer 1, 1, 1, 0, 0, 0
num_credits integer 2, 1, 1, 1, 2, 1
job integer 2, 2, 1, 2, 2, 1
num_dependents integer 1, 1, 2, 2, 2, 2
telephone integer 1, 0, 0, 0, 0, 1
foreign integer 0, 0, 0, 0, 0, 0
response integer 1, 0, 1, 1, 0, 1

 

As we can see, the data are coherent with the infos that “the client” provided us. Most of them are binary of categorical, while only few are numerical.

More than seeing the first values of our variables and their types, we also need to understand how distributed they are and their link with each other. Thanks to a correlation plot, we can see the correlation between each pair of variable, but especially their correlation with our response variable in which we are interested in.
We see for instances that variables like chk_acct, duration, history, sav_accnt or rent are highly correlated (positively or negatively) with our outcome variable and that they will be likeli to influe it in the models that we are going to plot. Others like present_resident or retaining should have low impact.

plot_correlation(cred, type="all", title = "correlation graph")

In addition, we can appreciate the summary of the different variables. The frequency table of history is presented below as an example:

Summary of variable history
Value
Min. 0.000
1st Qu. 2.000
Median 2.000
Mean 2.545
3rd Qu. 4.000
Max. 4.000
Frequency table of history
Values Frequency
0 40
1 49
2 530
3 88
4 293

 
 

However, presenting such a summary for all variables can be long and boring. It can be better to represent these number visually. A Boxplot is optimal to get all the important values for the numerical data, while a barplot will give us strong insights for categorical data. Let’s appreciate the following graphs:

# for (i in 1:(length(cred)-1)) {
#   if (range(cred[, i] < 5)) {
#     print(
#       ggplot(cred, aes(x = cred[, i])) + geom_bar(stat = "count", position = "dodge") +
#         ggtitle(str_c("Barplot of\n", paste(
#           colnames(cred[i])
#         ))) +
#         xlab(colnames(cred[i])) +
#         ylab("Total") +
#         my_theme()
#     )
#   } else
#   {
#     print(
#       ggplot(cred, aes(y = cred[, i])) + geom_boxplot() + 
#         ylab(colnames(cred[i])) +
#       ggtitle(str_c("Boxplot of\n ", paste(
#         colnames(cred[i])
#       ))) +
#         my_theme() +
#         theme(
#         axis.text.x=element_blank())
#     )
#   }
# }

Thanks to these graphs, we can better understand our data at a glance and will be able to refer to them when needed.

In addition, these graphs enable un too see that some data are not tidy. For instance, education should be a binary variable. However, we can see on the histogram of this variable that we have data where \(-1\) were recorded. We have the same problem for the binary variable guarantor were a value \(2\) is present.
In addition, we can also have strong suspitions that the variable age has wrong recorded data as we can see an outlier with a value much bigger than 100.
We will have to confirm our first assumptions and to modify these dirty data in an appropriate way.

Problematic variables

Age

Let’s first look at our variable age. We assume that, generally, a person will never live more than a hundred year, and will never contract a credit at such ages. This is why the data with \(Age > 100\) are definitely wrong recorded. We will therefore have to replace them in our database.
First, we have to find how much data are potentially dirty according to our assumptions and to localisate them in order to replace them.

Number of instances with age > 100
Var1 Freq
FALSE 999
TRUE 1
Position of instance with age > 100
x
537

 
 

According to our results, we have one data with \(age > 100\) that has to be replaced. It is the instance 537 and its value is 125.

We can consider different options to replace this value. The first one could be to replace it by a value at random within the range (a value at random between 19 and 75, which is the second lowest value after \(125\).
However, according to the following histogram, the distribution of the age (without the erroneous data) is inequal with a concentration around small values (which is logical as young people generally have less money than elders and therefore are more subject to ask for credits).

It could therefore be possible to replace it at random with different probabilities according to the size of each class.
We prefer to opt for the median (equal to 33) to replace our problematic value as it offers more convenience.
Note that for calculating the median, our problematic value should not be used.

cred$age[which(age>75)] <- median(age[age!=max(age)])

An alternative could have been to use the mean, but, as we have no really big outlier, both values would have been close to eachother (\(mean = 35.596\) while \(median = 33\)).

Next, we also have to deal with our two categorical data that have been wrong recorded:
- one in education
- one in guarantor

They also have to be cleaned.

Education

The following is again the barplot of education.

Here, the likelihood that this wrong recorded data is equal to \(0\) is clearly higher. Therefore, each of the previously presented methods (using the mean, using the median and even assigning it to a class at random) would with a high probability result in assigning this instance and assign it the value \(Education = 0\).
We can confirm these first assumption with a frequency table:

Frequency table of education
-1 0 1
sample size 1 950 49
proportion 0.1 % 95 % 4.9 %

 

It is indeed more appropriate to replace our value by 0 as the probability of belonging to this class is close to 20 times bigger.

cred$education[which(education==-1)] <- 0

 

Guarantor

Concerning the variable guarantor, we can look at the frequency table and plot the barplot as well:

Frequency table of guarantor
0 1 2
sample size 948 51 1
proportion 94.8 % 5.1 % 0.1 %

Again, for the same reasons, it is preferable to replace the wrong recorded data by 0.

cred$guarantor[which(guarantor==2)] <- 0
present_resident

In addition, the variable present_resident is also problematic as it doesn’t have the same range as the other categorical values. Its range goes from 1 to 4 whereas it should go from 0 to 3 like the other ones. We can modifiy its values in order to have the same format everywhere.

cred$present_resident <- cred$present_resident - 1

Variable num_credits: y’a pas de 0, pas cohérent…

These first steps have enabled us to better understand our explanatory variables and to clean the problematic ones.
We now have to focus in detail to the response variable on which the predictions should be made.

Response variable

As our final goal is to predict if a customer should be classified as a risky one or not, we have to have a particular look at our response variable that establishes if an applicant presents a good or a bad risk.

Frequency table of our response variable
0 1
sample size 300 700
proportion 30 % 70 %

As we can see, if a random customer steps in the bank, the a priori probability that he will present a good ranking will be of 70%.
Without any calculations, the bank has more chances to make a good decision when octroying a credit.
However, the consequences can be really dramatic if 30% of the credits that the bank gives are not totally reimbursed. That’s why we have to develop a model that will have an accuracy much better than these initial 70% that are obtained using a naive method of always octroying a credit.

We will get to this later. First, after having presented each variable, it could be interesting to see if we can already have some assumptions concerning the relations between the expanatory variables and the response variable.

Interactions between the explanatory variables and the response variable

Text…

–> Expliquer qu’on peut deja avoir un premier apriori sur les variables qui vont avoir un impact

Modelling

Before beginning to work on our different models, we can create a test set and a training set in order to build the models. This procedure is used is order to avoid overfitting and to predict instances which have been used while building the model. In order to evalate the performance of the different models, we will have to use the exact same training and test sets for each model to be sure that the performance differences will result from the model we use and not from the randomess of splitting differently both sets.
A Cross-Validation will be performed at the end in order to compare the different models and to choose which one should be used to make good predictions.

set.seed(12345) 
index.train <- sample(1:nrow(cred), size=nrow(cred)*0.7, replace=FALSE)
cred.train <- cred[index.train,]
cred.test <- cred[-index.train,]

To evaluate our models, we will use the accuracy as main measure of performance. However, computing the accuracy for each model can be quite long… We prefer to build a function to be able to retrieve at any time based on a confusion matrix:

# Creating of a function to retrieve the accuracy from a confusion matrix
accuracy <- function(c){
  print(sum(diag(c))/sum(c))
}

CART

** Petite description de CART, …** –> Comment c’est calculé (decrease of impurity criterion) –> iterative procedure etc.

Building the model

cart.model <- rpart(response ~ .,  data = cred.train, method = "class")
rpart.plot(cart.model, main="Original decision tree")

–> At this time, we have obtained a decision tree with good prediction capabilities, but that is also really complex (a lot of split and of branches). An idea to simplify this model without loosing too much predictive capabilities, it could be useful to prune it, keeping only the most important splits linked to the most important variables.

Tuning the model: Pruning

As already said, our complex tree has to be pruned in order reduce its complexity. To do so, we will use the 1 - SE rule.
The idea of this rule is really general. As one would establish a t-test in statistics to see if two measures are statistically different, the 1 - SE rule tries to establish if two models produce statistically different results (if we can affirm that one outperforms the other).
We therefore consider the xerror (the criterion in which we are interested in and that we want to minimize) and its standard deviation xstd. A models that falls within 1 standard deviation of the most complex model can be considered as equivalent in term of performance. Therefore, using this rule, we will be able to prune the tree to get a much simplier model, without loosing in quality as the performance capability of our new model will be statistically the same as the first one.
To select the size of our pruned tree, we will look at the xerror of our original tree (the most complex one) and add one standard error xstd. We will then select the simplier tree with an xerror that lies within the calculated value.

Let’s have a look at these values from our original tree and decide where to prune it.

CP table of the CART model
CP nsplit rel error xerror xstd
0.0480769 0 1.0000000 1.0000000 0.0581302
0.0336538 3 0.8413462 0.9423077 0.0571125
0.0224359 4 0.8076923 0.9615385 0.0574629
0.0144231 9 0.6826923 0.9423077 0.0571125
0.0120192 10 0.6682692 0.8942308 0.0561857
0.0100000 12 0.6442308 0.8701923 0.0556943
Variable importance table first 6 instances
x
chk_acct 34.568946
sav_acct 13.636361
duration 13.593226
amount 13.224203
history 9.711696
num_credits 6.728019

Accoring to our first table, the minimal xerror is equal to 0.8701923 , the minimal xstd to 0.0556943 and the sum of both is therefore equal to 0.9258866.

The highest xerror below this value is equal to 0.8942308 and we will therefore prune the tree at CP = 0.0120192. According to our previous table, this represents 10 splits to be kept, equivalent to a tree of size 11.
It is also possible to observe this value on the graph where the dash line indicates the xerror plus its xstd from the most complex tree. We therefore select the less complex tree under this line, which is the size of size 11 (or, again, with 10 splits).

We thus prune the tree at this value (the R function requires to indicate CP.)

cart.pruned <- prune(cart.model, cp=cp.pruned)

We can visualize this new pruned tree that is much smaller and therefore less complex than our original one and will use it to do our predictions, to build the confusion matrix and to calculate the underlying accuracy.

Predicting the values of the testing set

pred.cart <- predict(cart.pruned, newdata = cred.test, type = "class")
cart.tab <- table(Predictions = pred.cart, Observations = cred.test$response) # confusion matrix.
cart.tab %>% kable(caption = "Confusion matrix for the CART model",
                   col.names = c("Predict bad", "Predict good")) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = F,
    position = "l") %>%
  column_spec(1, border_right = T, width = "5em") %>%
  column_spec(2, width = "6em") %>%
  column_spec(3, width = "6em")
Confusion matrix for the CART model
Predict bad Predict good
bad 48 37
good 44 171

Based on this table, we can calculate the accuracy using our previously build function, and that calculates the element weel classified divided by the total number of elements.

# accuracy using our previously build function
accuracy(cart.tab) 
## [1] 0.73

It is possible to have more informaion, with for instance the sum of each rows with the following cross table.


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  300 

 
                   | pred.cart 
cred.test$response |       bad |      good | Row Total | 
-------------------|-----------|-----------|-----------|
               bad |        48 |        44 |        92 | 
                   |     0.522 |     0.478 |     0.307 | 
                   |     0.565 |     0.205 |           | 
                   |     0.160 |     0.147 |           | 
-------------------|-----------|-----------|-----------|
              good |        37 |       171 |       208 | 
                   |     0.178 |     0.822 |     0.693 | 
                   |     0.435 |     0.795 |           | 
                   |     0.123 |     0.570 |           | 
-------------------|-----------|-----------|-----------|
      Column Total |        85 |       215 |       300 | 
                   |     0.283 |     0.717 |           | 
-------------------|-----------|-----------|-----------|

 

We can see that 92 are predicated as bad and 208 are predicted as good.

–> Parler de accuracy en elle meme (0.74 pas top par rapport aux 0.7 de base si on prédit que “good”) –> Parler des bonnes predictions de 0 (c’est elles qui nous intéressent et prédire que 1 conduirait la banque à faire faillite) –> Parler FALSE POSTIVIE / FALSE NEGATIVE / TRUE POSITIVE / TRUE NEGATIVE

Logistic regression

–> Parler de pourquoi Logistic regression (fait un bon taf quand outcome binaire)

Building the model

logit.model <- glm(response ~., data=cred.train, family="binomial")
summary(logit.model)

Call:
glm(formula = response ~ ., family = "binomial", data = cred.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6929  -0.7144   0.3723   0.7072   2.0701  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       7.093e-01  9.950e-01   0.713 0.475913    
chk_acct          6.022e-01  9.061e-02   6.646    3e-11 ***
duration         -3.216e-02  1.094e-02  -2.940 0.003282 ** 
history           4.180e-01  1.101e-01   3.798 0.000146 ***
new_car          -1.057e+00  4.489e-01  -2.356 0.018493 *  
used_car          7.227e-01  5.617e-01   1.287 0.198256    
furniture         2.955e-01  4.765e-01   0.620 0.535199    
radio.tv          7.336e-03  4.519e-01   0.016 0.987048    
education        -5.931e-01  6.097e-01  -0.973 0.330649    
retraining       -9.425e-02  5.331e-01  -0.177 0.859664    
amount           -9.716e-05  5.338e-05  -1.820 0.068716 .  
sav_acct          2.550e-01  7.715e-02   3.305 0.000950 ***
employment        1.358e-01  9.380e-02   1.448 0.147588    
install_rate     -3.683e-01  1.067e-01  -3.451 0.000559 ***
male_div         -2.584e-01  4.703e-01  -0.549 0.582728    
male_single       4.530e-01  2.482e-01   1.825 0.067938 .  
male_mar_or_wid   3.651e-01  3.871e-01   0.943 0.345590    
co.applicant      2.489e-01  4.982e-01   0.500 0.617311    
guarantor         9.220e-01  5.126e-01   1.799 0.072096 .  
present_resident -5.245e-02  1.024e-01  -0.512 0.608392    
real_estate       9.158e-02  2.508e-01   0.365 0.714955    
prop_unkn_none   -1.857e-01  4.608e-01  -0.403 0.686948    
age               3.489e-03  1.034e-02   0.337 0.735888    
other_install    -7.747e-01  2.489e-01  -3.113 0.001855 ** 
rent             -8.173e-01  5.637e-01  -1.450 0.147125    
own_res          -5.232e-02  5.319e-01  -0.098 0.921645    
num_credits      -2.139e-01  1.985e-01  -1.078 0.281172    
job               5.951e-02  1.720e-01   0.346 0.729398    
num_dependents    6.262e-02  3.050e-01   0.205 0.837334    
telephone         4.250e-01  2.388e-01   1.780 0.075133 .  
foreign           1.851e+00  7.313e-01   2.531 0.011370 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 851.79  on 699  degrees of freedom
Residual deviance: 623.21  on 669  degrees of freedom
AIC: 685.21

Number of Fisher Scoring iterations: 5

–> Parler du fait qu’on a des similitudes avec CART (les variables importantes avant ont aussi des coeffs signi ici… ** pas toutes… ** –> nuancer…). –> Parler du fait qu’on a aussi plein de variables inutiles et qu’on va faire une stepwise pour minimiser AIC –> Dire l’idée de AIC (arbritrage entre perf du model et complexité) –> Expliquer comment ca marche

Tuning the model: Stepwise selection according to AIC criteria

logit.step <- step(logit.model, direction="both")
summary(logit.step)

–> Parler du fait qu’on a quand meme encore énormément de variables mais que le AIC est minimisé –> Parler du fait qu’on pourrait en virer manuellement (mais long et chiant et pas forcément utile)

Predicting the values of the testing set

pred.logit.step <- ifelse(
  predict(logit.step, newdata=cred.test, type="response") > 0.5, "good", "bad"
  ) # the predictions of the class
pred.logit.step.prob <- predict(logit.step, newdata=cred.test, type="response")

data.frame(pred.logit.step.prob,cred.test$response) %>% 
  ggplot(aes(x=cred.test$response, y = pred.logit.step.prob, group=cred.test$response)) + geom_boxplot() +
        my_theme() +
  ggtitle("Predicted probabilities against real values") +
  xlab("Values of the testing set") +
  ylab("Predicted probabilities of the logistic regression") 

glm.tab <-
  table( Observations = cred.test$response, Predictions = pred.logit.step) # confusion matrix.
glm.tab %>% kable(caption = "Confusion matrix for the Logistic regression model",
                  col.names = c("Predict 0", "Predict 1")) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = F,
    position = "l"
  ) %>%
  column_spec(1, border_right = T, width = "5em") %>%
  column_spec(2, width = "6em") %>%
  column_spec(3, width = "6em")
Confusion matrix for the Logistic regression model
Predict 0 Predict 1
bad 45 47
good 30 178
accuracy(glm.tab) # accuracy
## [1] 0.7433333

FALSE POSITIVE: 47 sur les 250 sont prédits 1 alors qu’ils sont 0… C’est la merde Idée: on peut changer la proba limite pour dire que c’est “bon” pour avoir moins de FALSE POSITIVE… Ca va baisser l’accuracy, mais ca va potentiellement nous faire perdre moins d’argent (on va refuser plus souvent les crédits à risques, et aussi les crédits sans risques…)

pred.logit_0.7 <- ifelse(
  predict(logit.step, newdata=cred[-index.train,], type="response") > 0.75, "good", "bad"
  ) # the predictions of the class
glm.tab <-
  table(Observations = cred.test$response, Predictions = pred.logit_0.7) # confusion matrix.
glm.tab %>% kable(caption = "Confusion matrix for the adapted Logistic regression model",
                  col.names = c("Predict bad", "Predict good")) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = F,
    position = "l"
  ) %>%
  column_spec(1, border_right = T, width = "5em") %>%
  column_spec(2, width = "6em") %>%
  column_spec(3, width = "6em")
Confusion matrix for the adapted Logistic regression model
Predict bad Predict good
bad 72 20
good 72 136
sum(diag(glm.tab))/sum(glm.tab) # accuracy
## [1] 0.6933333

–> En passant la limite à 0.75, on a gardé une bonne accuracy, et on fait moins d’erreurs d’évaluation parmis les “risky”. Faut voir se que la banque peut se permettre comme marge et si les risky conduisent forcément à des pertes…

Neural Network

Building the model and optimizing it: selecting the number of neurones in the hidden layer

La “littérature” (https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw) suggère de garder 1 seul layer un nb de neuronnes entre 1 et le nbr de dim: on va les tester tous et garder le meilleur… “In sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.”

“Over-parametrization quickly lead to instability in the estimate and can lead to overfitting. One possibility is to impose a penalty on the largest weights during the optimization. This is called regularization and, more specifically in the context of neural network, weight decay.

–> On essaye de trouver le nombre “optimal” de neurones. Vu que neural network utilise des valeurs aléatoires au début pour lancer l’algo etc. Ca varie. On calcule donc 10 fois l’accuracy liée à chaque nombre de neurones et voir si y’a un nb de neurones optimal.

UTILISER A CHAQUE ITERATION DES TRAIN ET DES TESTS SETS DIFFERENTS !

acc.nnet.cv <- matrix(nrow=30,ncol=10) 
for (j in 1:10){
index.train1 <- sample(1:nrow(cred), size=nrow(cred)*0.7, replace=FALSE)
cred.train1 <- cred[index.train1,]
cred.test1 <- cred[-index.train1,]
nnet.model = list()
for (i in 11:30) {
  nnet.model[[i]] = nnet(cred.train$response~., data = cred.train, size=i, maxit=200, decay = 1)
}
pred.nnet=list()
for (i in 11:30){
  pred.nnet[i] <- predict(nnet.model[i], cred.test, type="class")
  
}# making the predictions
  tab.nnet=list()
for (i in 11:30){
  tab.nnet[[i]] <- table(Reality=cred.test$response, Predicted=unlist(pred.nnet[i]))
}
  acc.nnet = list()
for (i in 11:30) {
  acc.nnet[i] <-
    sum(ifelse(cred.test$response == unlist(pred.nnet[i]), 1, 0), na.rm = TRUE) /
    length(cred.test$response)

  acc.nnet.cv[i,j] <- acc.nnet[[i]]} 
}

On utilise à chaque itération des train et test sets différents.

mean.acc.nnet.cv <- numeric(30)
sd.acc.nnet.cv <- numeric(30)
for(i in 11:30){mean.acc.nnet.cv[i] <- mean(acc.nnet.cv[i,])
sd.acc.nnet.cv[i] <- sd(acc.nnet.cv[i,])}
mean.acc.nnet.cv[-c(1:10)] # average mean for the models with i = 11,..,30 nodes
##  [1] 0.7646667 0.7640000 0.7666667 0.7653333 0.7600000 0.7626667 0.7680000
##  [8] 0.7683333 0.7650000 0.7663333 0.7646667 0.7666667 0.7633333 0.7693333
## [15] 0.7656667 0.7713333 0.7703333 0.7616667 0.7673333 0.7750000
accu <- data.frame(Average=mean.acc.nnet.cv[-c(1:10)], Neurones = c(11:30))

The

which.max(mean.acc.nnet.cv) # neurones leading to the model with the highest accuracy
[1] 30
mean.acc.nnet.cv[which.max(mean.acc.nnet.cv)] # the accuracy
[1] 0.775

The maximum accuracy is equal to 0.775.
On the following plot, we can see at a glance the accuracies and the model that we are going to select.

We can directly see which that the model with 30 neurones presents the highest accuracy.

Obviously, this choice is subject to discussion and other methods could be preferred. At the end, the differences resulting from the use of our different models are really low, meaning that they have all quite similar predictive capabilities.

We can therefore continue to explore further the model with 30 neurones that we retained.

TESTING DIFFERENT DECAYS

Note, on a considéré plus de decay, mais pour plus de “computational efficicency”, on va en display seulement certains…

acc.nnet.cv2 <- matrix(nrow=length(seq(2^-8,2,2^-3)),ncol=10) # we considered initially extreme small values like seq(0,2^-5,2^-8), seq(0,2^-4,2^-10) intermediate and high and the range that we propose here best fits the data.
j <- 1
for (j in 1:10){
index.train2 <- sample(1:nrow(cred), size=nrow(cred)*0.7, replace=FALSE)
cred.train2 <- cred[index.train2,]
cred.test2 <- cred[-index.train2,]
nnet.model2 = list()
pred.nnet2=list()
tab.nnet2=list()
acc.nnet2 = list()
k <- 1
for (i in seq(2^-8,2,2^-3)) {
  nnet.model2[[k]] = nnet(cred.train$response~., data = cred.train, size=which.max(mean.acc.nnet.cv), maxit=200, decay = i)
pred.nnet2[[k]] <- predict(nnet.model2[k], cred.test, type="class") #  making the predictions
tab.nnet2[[k]] <- table(Reality=cred.test$response, Predicted=unlist(pred.nnet2[k]))
acc.nnet2[[k]] <- sum(ifelse(cred.test$response == unlist(pred.nnet2[k]), 1, 0), na.rm = TRUE) /
    length(cred.test$response)
  acc.nnet.cv2[k,j] <- acc.nnet2[[k]]
  k <- k+1
  } 
  j <- j+1
}
mean.acc.nnet.cv2 <- numeric(16)
sd.acc.nnet.cv2 <- numeric(16)
for(i in 1:16){mean.acc.nnet.cv2[i] <- mean(acc.nnet.cv2[i,])
sd.acc.nnet.cv2[i] <- sd(acc.nnet.cv2[i,])}
mean.acc.nnet.cv2 # average mean for the models with i = 11,..,30 nodes
##  [1] 0.7160000 0.7526667 0.7526667 0.7686667 0.7740000 0.7746667 0.7733333
##  [8] 0.7676667 0.7676667 0.7686667 0.7676667 0.7643333 0.7643333 0.7626667
## [15] 0.7673333 0.7666667
accu2 <- data.frame(Average=mean.acc.nnet.cv2, Decay = seq(2^-8,2,2^-3))
which.max(mean.acc.nnet.cv2) # neurones leading to the model with the highest accuracy
[1] 6
mean.acc.nnet.cv2[which.max(mean.acc.nnet.cv2)] # the accuracy
[1] 0.7746667
round(seq(2^-8,2,2^-3)[which.max(mean.acc.nnet.cv2)],1) # the retained decay
[1] 0.6

Model selection and predicting the values of the testing set

We present you the characteristics of the model retained, with 30 neurones in the hidden layer and a decay of 0.6. As usual, we trained the model and predict the values of the test set to be able to build the confusion matrix and to calculate the accuracy.

nnet.model.retained <- nnet(cred.train$response ~ .,data = cred.train,maxit = 200,size = which.max(mean.acc.nnet.cv), decay = round(seq(2^-8,2,2^-3)[which.max(mean.acc.nnet.cv2)],1))

pred.nnet.retained <- predict(nnet.model.retained, cred.test, type="class") # predictions on the test set


tab.nnet.retained <- table(Reality = cred.test$response, Predicted = unlist(pred.nnet.retained)) # confusion matrix

acc.nnet.retained <- sum(ifelse(cred.test$response == unlist(pred.nnet.retained), 1, 0), na.rm = TRUE) /
  length(cred.test$response) # accuracy

In addition, we can plot our neural network:

This graph shows that our model can be quite complex, there a plenty of arrows. It is not necessary to understand precisely all of them and they can be seen as a “black box”, meaning that the interpretability of such a model is modest. But, afterall, what we need is to make good predictions!

Confusion matrix of the fitted NNET model
Predict good Predict bad
bad 52 40
good 28 180

We obtain a final accuracy of 0.773 on our testing set.

Support Vector Machine (SVM)

Building the model

Optimizing the model: selecting good cost and gamma parameters

Predicting the values of the testing set

Model x (K-nn)

Building the model

scaled.cred <- scale(cred[,-length(cred)]) %>% data.frame(response=cred$response)
# scaling except last row (except response)

scaled.cred.train <- scale(cred.train[,-length(cred.train)]) %>% data.frame(response=cred.train$response)
scaled.cred.test <- scale(cred.test[,-length(cred.test)]) %>% data.frame(response=cred.test$response)
# Creation of 20 KNN: 10 with i={1,..,10} and k = 2 and 10 with i={1,..,10} and k = 3 
knn.model = list()
k <- 0
for (j in 2:3) {
  for (i in 1:20) {
    knn.model[[20 * k + i]] <- kknn(response ~ ., k=i, distance=j,train=scaled.cred.train, test=scaled.cred.test)
  }
  k <- k+1
}

# making the predictions on the testing set
pred.knn=list()
for (i in 1:length(knn.model)){
  pred.knn[[i]] = table(Prediction=knn.model[[i]]$fitted.value, Actual=cred.test$response)
}

# calculating the accuracy
acc.knn = list()
for (i in 1:length(knn.model)) {
  acc.knn[i] <- sum(diag(pred.knn[[i]]))/250
}
[1] "The selected knn model uses a distance of 2 and is computed with the 12 neerest neighbors. The obtained accuracy is equal to 0.888"
Confusion matrix of the fitted KNN model
Predict good Predict bad
bad 32 18
good 60 190

Petit DB donc pas un pb que “lazy learner”

Optimizing the model: selecting the number of nodes and layers

Predicting the values of the testing set

Apres y’en a marre…

Cross Validation approach

–> Expliquer pourquoi on fait une cross-validation –>

Creation of 10 different sets in order to do the Cross-Validation. The number of instances being 1000, we will make 10 sets of size 100. They are stored in a list named test.list. At each step, the remaining part of the data base is stored in the list train.test.

test.list <- list() # creates an empty list that will be the test sets
train.list <- list() # creates an empty list that will be the train sets
counter <- 0

# Creates the 10 sets of size 300
for (i in 1:10){
  index <- counter + c(1:100) # the row numbers that will be in the test set
  test.list[[i]] <- cred[index, ] # the test set number i
  train.list[[i]] <- cred[-index, ] # the train set number i
  counter <- counter + 100
}

For example, test.list[[1]] is a data set of 300 rows taken at random from our data.

# test.list[[1]] %>% kable(caption="Our first test set named test.list[[1]]") %>% 
#   kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F) %>%
#   scroll_box(width = "100%", height = "200px")
# 
# 
# train.list[[1]] %>% kable(caption="Train set associated with test.list[[1]]") %>% 
#   kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F) %>%
#   scroll_box(width = "100%", height = "200px")

CART Model

acc.cart.cv <- numeric(10)
for (i in 1:10){
  cart.cv <- rpart(response~., data=train.list[[i]]) # original tree
  cp.pruned <- cart.model$cptable[(min(which(cart.model$cptable[,ncol(cart.model$cptable) -1] < cart.model$cptable[nrow(cart.model$cptable), ncol(cart.model$cptable)] + cart.model$cptable[nrow(cart.model$cptable), ncol(cart.model$cptable) - 1]))),1] # to prune at
  cart.pruned.cv <- prune(cart.model, cp=cp.pruned) # the pruned tree for predictions
  cart.pred.cv <- predict(cart.pruned.cv, newdata=test.list[[i]], type="class") # making the predictions
  tab.cart.cv <- table(test.list[[i]]$response, cart.pred.cv) # the confusion matrix
  acc.cart.cv[i] <- accuracy(tab.cart.cv) # the final accuracy
}

Logistic Regression

acc.glm.cv <- numeric(10)
for (i in 1:10){
  glm.cv <- glm(response ~., data=train.list[[i]], family="binomial") # original logistic regression
  logit.step.cv <- step(glm.cv, direction="both") # the pruned tree for predictions 
  glm.pred.cv <- ifelse(
  predict(logit.step.cv, newdata=test.list[[i]], type="response") > 0.5, "good", "bad"
  ) # making the predictions
  tab.glm.cv <- table(test.list[[i]]$response, glm.pred.cv) # the confusion matrix
  acc.glm.cv[i] <- accuracy(tab.glm.cv) # the final accuracy
}
acc.cart.cv
##  [1] 0.79 0.74 0.77 0.82 0.74 0.82 0.74 0.78 0.81 0.79
mean(acc.cart.cv)
## [1] 0.78
sd(acc.cart.cv)
## [1] 0.03197221
acc.glm.cv # the vector of accuracies for the glm model
##  [1] 0.84 0.67 0.79 0.80 0.78 0.72 0.69 0.78 0.78 0.75
mean(acc.glm.cv) # the average accuracy
## [1] 0.76
sd(acc.glm.cv) # the standard deviation of the accuracy
## [1] 0.05249339